From scripts to pipelines with targets


https://papsti.github.io/talks/2023-10-19_targets.html

Irena Papst
Senior Scientist
Public Health Agency of Canada

Pipelining is the process of writing down a recipe for outputs where all dependencies are stated explicitly.

Isn’t that just a script??

No!

“Disease X is picking up, can you forecast it for the next few weeks?”




“Can you clean this data set for the team?”


“Can you make some plots for the presentation next Tuesday?”


“Can you make the tables and figures for our paper?”

“Disease X is picking up, can you forecast it for the next few weeks?”

A simple script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

Great! Now we need multiple forecast scenarios

A slightly-more-complicated script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
for(scenario in c("A", "B", "C")){
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

Parallelization to the rescue?

library(readr); library(dplyr); library(ggplot2)
library(doParallel); cl <- makeCluster(4); registerDoParallel(cl)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
foreach(scenario = c("A", "B", "C")) %dopar% {
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

What if…

  • I only want to re-run certain scenarios?
  • I’m running into errors for one scenario and need to debug?
  • I step away for a meeting and forget which scenario results are up-to-date?

That’s a lot to manage!

Enter targets

“The targets package is a Make-like pipeline tool for statistics and data science in R.”

the targets user manual

Just use Make!!!

I don’t wanna!

In a government and/or corporate setting, an R package can be easier

  • to install
  • to set up for colleagues (thanks to renv <3)

Show & tell

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

Show & tell

Before

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

After

# as a target
tar_target(
  # name of the target
  data,
  # recipe for the target
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Make a _targets.R file

library(targets)
use_targets()

From script to pipeline

library(targets)
source("R/make_forecast.R"); source("R/plot_forecast.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "lubridate"))

# pipeline
list(
  # read & clean data
  tar_target(
    data,
    (read_csv("data/case-counts.csv")
      |> filter(date >= "2023-01-01")
    )
  ),
  # simulate
  tar_target(
    sim,
    make_forecast(data)
  ),
  # plot results
  tar_target(
    plot,
    plot_forecast(sim)
  )
)

The free lunch

Easy to make

tar_make(plot)
▶ start target data
● built target data [0.304 seconds]
▶ start target sim
● built target sim [0.003 seconds]
▶ start target plot
● built target plot [0.008 seconds]
▶ end pipeline [0.401 seconds]

Easy to read

tar_read(plot)
tar_load(plot)

Easy to visualize the pipeline

tar_visnetwork()

Only make what you need

tar_visnetwork()
tar_make()
✔ skip target data
✔ skip target sim
▶ start target plot
● built target plot [0.01 seconds]
▶ end pipeline [0.295 seconds]

Only make what you need

tar_visnetwork()

The recipe is laid out

tar_manifest()
# A tibble: 3 × 2
  name  command                                                                 
  <chr> <chr>                                                                   
1 data  "(filter(read_csv(\"data/case-counts.csv\", show_col_types = FALSE), \n…
2 sim   "make_forecast(data)"                                                   
3 plot  "plot_forecast(sim)"                                                    

And so much more!

  • Branching: easily repeat sections of the pipeline
  • Distributed computing: parallelize easily
  • tarchetypes: make target factories for common tasks

Learn from my mistakes

Not everything has to be a target

  • Start with broad strokes, drill down into smaller pieces as needed
    • Larger targets = bigger speed-ups when skipped
    • Smaller targets = more control over what is skipped and what isn’t
  • Ask yourself:
    • Might this target get updated independently of other targets?
    • Is this something I’ll want to look at down the line?

Don’t be afraid to run the pipeline

tar_make(sim)
✔ skip target data
▶ start target sim
✖ error target sim
▶ end pipeline [0.278 seconds]
Error:
! Error running targets::tar_make()
  Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
  Debugging guide: https://books.ropensci.org/targets/debugging.html
  How to ask for help: https://books.ropensci.org/targets/help.html
  Last error: object 'x' not found

Don’t do this!

source("R/make_forecast.R")
debug(make_forecast)
tar_load(data)
make_forecast(data)

Do this

# insert browser() statement 
# into make_forecast()
tar_make(sim,
         callr_function = NULL)

Be explicit

Don’t do this!

tar_target(
  data,
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Do this

tar_target(
  data.file,
  "data/case-counts.csv",
  format = "file"
) # track the file!!!
tar_target(
  data,
  (read_csv(data.file)
   |> filter(date >= "2023-01-01")
  )
)

Dynamic branching > static branching

  • Static branching
    • Targets generated before pipeline is run
    • Clearly named targets generated (sim_ON)
    • More annoying to aggregate
  • Dynamic branching
    • Targets generated at run-time
    • Cryptic names (sim_3e0e0255)
    • Automagic

Final thoughts

You don’t always need a pipeline…

  • Short, one-off task? Maybe write a script
  • Multi-step process, will have to run again, will have to pass off to a colleage? Maybe write a pipeline

…but you can still set yourself up for success

  • Adopt file struture compatible with targets pipelines (and R packages)
    • Define functions in R/
  • Document functions as if you’re going to package them
    • Use roxygen

Getting started

Thank you!

https://papsti.github.io/talks/2023-10-19_targets.html